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3  Scientific  Progress  and  Accomplishments 

3.1  Introduction  and  Overview 

Problems  involving  massive  amounts  of  data  arise  naturally  in  a  variety  of  disciplines,  such  as 
spatial  databases,  geographic  information  systems,  text  repositories,  string  databases,  constraint 
logic  programming,  object-oriented  databases,  statistics,  virtual  reality  systems,  and  computer 
graphics.  NASA’s  Earth  Observing  System  project,  the  core  part  of  the  Earth  Science  Enterprise 
(formerly  Mission  to  Planet  Earth),  produces  petabytes  (1015  bytes)  of  raster  data  per  year!  A 
major  challenge  is  to  develop  mechanisms  for  processing  the  data  efficiently,  or  else  much  of  it  will 
be  useless. 

The  bottleneck  in  many  applications  that  process  massive  amounts  of  data  is  the  I/O  communi¬ 
cation  between  internal  memory  and  external  memory.  The  bottleneck  is  accentuated  as  processors 
get  faster  and  parallel  processors  are  used.  Parallel  disk  arrays  are  often  used  to  increase  the  I/O 
bandwidth.  The  goal  of  this  proposal  is  to  deepen  our  understanding  of  the  limits  of  I/O  systems 
and  to  construct  external  memory  algorithms  that  are  provably  efficient.  The  three  measures  of 
performance  are  number  of  I/Os,  disk  storage  space,  and  CPU  time.  Even  when  the  data  fit  entirely 
in  memory,  communication  can  still  be  the  bottleneck,  and  the  related  issues  of  caching  become 
important. 

Theoretical  work  involves  development  and  analysis  of  provably  efficient  external  memory  al¬ 
gorithms  and  cache-efficient  algorithms  for  a  variety  of  important  application  areas.  In  [5],  we  give 
a  broad  survey  of  the  state  of  the  art  in  the  design  and  analysis  of  external  memory  algorithms 
and  data  structures.  We  address  several  batched  and  on-line  problems,  involving  text  databases, 
prefetching  and  streaming  data  from  parallel  disks,  and  database  selectivity  estimation.  Our  ex¬ 
perimental  validation  uses  our  TPIE  programming  environment. 

3.2  Research  Results 
3.2.1  Parallel  Disks 

Technology  trends  indicate  that  developing  techniques  that  effectively  use  multiple  disks  in  parallel 
in  order  to  speed  up  the  performance  of  external  sorting  is  of  prime  importance.  In  [2],  we  look  at 
the  simple  randomized  merging  ( SRM )  mergesort  algorithm  that  we  earlier  showed  to  be  the  first 
parallel  disk  sorting  algorithm  that  requires  a  provably  optimal  number  of  passes  and  that  is  fast 
in  practice.  Knuth  (in  the  new  edition  of  The  Art  of  Computer  Programming ,  Vol.  3:  Sorting  and 
Searching)  recently  identified  SRM  (which  he  calls  “randomized  striping”)  as  the  method  of  choice 
for  sorting  with  parallel  disks.  In  [2],  we  present  an  efficient  implementation  of  SRM,  based  upon 
novel  data  structures.  We  give  a  new  implementation  for  SRM’s  lookahead  forecasting  technique 
for  parallel  prefetching  and  its  forecast  and  flush  technique  for  buffer  management.  Our  techniques 
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amount  to  a  significant  improvement  in  the  way  SRM  carries  out  the  parallel,  independent  disk 
accesses  necessary  to  efficiently  read  blocks  of  input  runs  during  external  merging. 

We  present  the  performance  of  SRM  over  a  wide  range  of  input  sizes  and  compare  its  per¬ 
formance  with  that  of  disk-striped  mergesort  (DSM),  the  commonly  used  technique  to  implement 
external  mergesort  on  D  parallel  disks.  DSM  consists  of  using  a  standard  mergesort  algorithm 
in  conjunction  with  striped  I/O  for  parallel  disk  access.  SRM  merges  together  significantly  more 
runs  at  a  time  compared  with  DSM,  and  thus  it  requires  fewer  merge  passes.  We  demonstrate  in 
practical  scenarios  that  even  though  the  streaming  speeds  for  merging  with  DSM  are  a  little  higher 
than  those  for  SRM  (since  DSM  merges  fewer  runs  at  a  time),  sorting  using  SRM  is  significantly 
faster  than  with  DSM,  since  SRM  requires  fewer  passes. 

The  techniques  in  this  paper  can  be  generalized  to  meet  the  load-balancing  requirements  of 
other  applications  using  parallel  disks,  including  distribution  sort,  multiway  partitioning  of  a  file 
into  several  other  files,  and  some  potential  multimedia  streaming  applications. 

3.2.2  XML  Databases 

The  extensible  mark-up  language  (XML)  is  gaining  widespread  use  as  a  format  for  data  exchange 
and  storage  on  the  World  Wide  Web.  Queries  over  XML  data  require  accurate  selectivity  estima¬ 
tion  of  path  expressions  to  optimize  query  execution  plans.  Selectivity  estimation  of  XML  path 
expression  is  usually  done  based  on  summary  statistics  about  the  structure  of  the  underlying  XML 
repository.  All  previous  methods  require  an  off-line  scan  of  the  XML  repository  to  collect  the 
statistics. 

In  [6],  we  propose  XPathLearner,  a  method  for  estimating  selectivity  of  the  most  commonly 
used  types  of  path  expressions  without  looking  at  the  XML  data.  XPathLearner  gathers  and  re¬ 
fines  the  statistics  using  query  feedback  in  an  on-line  manner  and  is  especially  suited  to  queries  in 
Internet  scale  applications  since  the  underlying  XML  repositories  are  likely  to  be  inaccessible  or 
too  large  to  be  scanned  entirely.  Besides  the  on-line  property,  our  method  also  has  two  other  novel 
features:  (a)  XPathLearner  is  workload  aware  in  collecting  the  statistics  and  thus  can  be  dramati¬ 
cally  more  accurate  than  the  more  costly  off-line  method  under  tight  memory  constraints,  and  (b) 
XPathLearner  automatically  adjusts  the  statistics  using  query  feedback  when  the  underlying  XML 
data  change.  We  show  empirically  the  estimation  accuracy  of  our  method  using  several  real  data 
sets. 

3.2.3  Streaming  Algorithms 

In  [1],  we  investigate  the  problem  of  smoothing  multiplexed  network  traffic,  when  either  a  streaming 
server  transmits  data  to  multiple  clients,  or  a  server  accesses  data  from  multiple  storage  devices 
or  other  servers.  We  introduce  efficient  algorithms  for  lexicographically  optimally  smoothing  the 
aggregate  bandwidth  requirements  over  a  shared  network  link.  In  the  data  transmission  problem,  we 
consider  the  case  in  which  the  clients  have  different  buffer  capacities  but  no  bandwidth  constraints, 
or  no  buffer  capacities  but  different  bandwidth  constraints.  For  the  data  access  problem,  we  handle 
the  general  case  of  a  shared  buffer  capacity  and  individual  network  bandwidth  constraints.  Previous 
approaches  in  the  literature  for  the  data  access  problem  handled  either  the  case  of  only  a  single 
stream  or  did  not  compute  the  lexicographically  optimal  schedule. 

Lexicographically  optimal  smoothing  ( lexopt  smoothing)  has  several  advantages.  By  provably 
minimizing  the  variance  of  the  required  aggregate  bandwidth,  maximum  resource  requirements 
within  the  network  become  more  predictable,  and  useful  resource  utilization  increases.  Fairness  in 
sharing  a  network  link  by  multiple  users  can  be  improved,  and  new  requests  from  future  clients 
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are  more  likely  to  be  successfully  admitted  without  the  need  for  frequently  rescheduling  previously 
accepted  traffic.  Efficient  resource  management  at  the  network  edges  can  better  meet  quality  of 
service  requirements  without  restricting  the  scalability  of  the  system. 

3.2.4  Space-Efficient  Indexes  for  Text  Databases 

The  proliferation  of  online  text,  such  as  on  the  World  Wide  Web  and  in  databases,  motivates  the 
need  for  space-efficient  text  indexing  methods  that  support  fast  string  searching.  In  this  scenario, 
consider  a  text  T  that  is  made  up  of  n  symbols  drawn  from  a  fixed  alphabet  E  and  that  is  represented 
in  nlog|E|  bits  by  encoding  each  symbol  with  log|E|  bits.  The  goal  is  to  support  quick  search 
queries  of  any  string  pattern  P  of  m  symbols,  with  T  being  fully  scanned  only  once,  namely,  when 
the  index  is  created. 

Text  indexing  schemes  published  in  the  literature  are  greedy  of  space  and  require  additional 
fi(nlogn)  bits  in  the  worst  case.  For  example,  suffix  trees  and  suffix  arrays  need  Q(n)  memory 
words  of  O(logn)  bits  in  the  standard  unit  cost  RAM.  These  indexes  are  larger  than  the  text  itself 
by  a  factor  of  ft(log|S|  n),  which  is  significant  when  E  is  of  constant  size,  such  as  ASCII  or  UNICODE. 
On  the  other  hand,  they  support  fast  searching  either  in  0(m log  |E|)  time  or  in  0(m  +  logn)  time, 
plus  an  output-sensitive  cost  O(occ)  for  listing  the  pattern  occurrences. 

In  [4],  we  present  a  new  text  index  that  is  based  upon  new  compressed  representations  of  suffix 
arrays  and  suffix  trees.  It  achieves  0(m/  logjsi n  d"  n)  search  time  in  the  worst  case,  for  any 
constant  0  <  e  <  1,  with  at  most  (e-1  +  0(l))nlog|E|  bits  of  storage;  that  is,  the  index  size  is 
comparable  to  the  text  size  in  the  worst  case.  The  above  bounds  improve  both  time  and  space  of 
previous  indexing  schemes.  Listing  the  pattern  occurrences  introduces  a  sublogarithmic  slowdown 
factor  in  the  output-sensitive  cost,  giving  0(occ  log£2|  n)  time  as  a  result.  When  the  patterns  are 
sufficiently  long,  namely,  for  m  =  0((log2+£n)(log|Sjlogn)),  we  can  use  auxiliary  data  structures 
in  0(nlog  |E|)  bits  to  obtain  a  total  search  bound  of  0(m/  log)S|  n  +  occ)  time,  which  is  optimal. 

3.2.5  Entropy-Compressed  Text  Indexes 

In  [3]  we  continue  our  work  on  space-efficient  indexes  and  present  a  novel  implementation  of 
compressed  suffix  arrays  exhibiting  new  tradeoffs  between  search  time  and  space  occupancy  for 
a  given  text  (or  sequence)  of  n  symbols  over  an  alphabet  E,  where  each  symbol  is  encoded  by 
lg|E|  bits.  We  show  that  compressed  suffix  arrays  use  just  nHh  +  0(nlg  lgn/lg|2|  n)  bits,  while 
retaining  full  text  indexing  functionalities,  such  as  searching  any  pattern  sequence  of  length  m  in 
0(m  lg  |E|  +  polylog(n))  time.  The  term  Hh  <  lg  |E|  denotes  the  fith-order  empirical  entropy  of  the 
text,  which  means  that  our  index  is  nearly  optimal  in  space  apart  from  lower-order  terms,  achieving 
asymptotically  the  empirical  entropy  of  the  text  (with  a  multiplicative  constant  1).  If  the  text  is 
highly  compressible  so  that  Hn  —  o(l)  and  the  alphabet  size  is  small,  we  obtain  a  text  index  with 
o(m)  search  time  that  requires  only  o(n)  bits.  We  also  report  further  results  and  tradeoffs  on  on 
high-order  entropy-compressed  text  indexes. 

4  Technology  Transfer 

We  plan  to  pursue  the  practical  applications  of  space-efficient  search  indexes.  Implementation  is 
ongoing. 

Our  efforts  are  also  having  an  impact  internationally.  We  have  had  discussions  about  the 
feasibility  of  adding  parallel  disk  capabilities  to  the  LEDA  project  at  Max  Planck  in  Saarbruecken, 
Germany. 
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